Artificial Intelligence in Medicine
○ Elsevier BV
Preprints posted in the last 90 days, ranked by how well they match Artificial Intelligence in Medicine's content profile, based on 15 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.
Romagnoli, F.; Pellegrini, M.
Show abstract
BackgroundThe ideal of personalized medicine is to support the clinical decision process towards the right drug for the right patient at the right time, by using, among other diagnostic tools, molecular biomarkers that are specifically dependent on the patient status and on the therapeutic options. Several challenges must be overcome to realize this vision. Patients present a wide spectrum of genetic variability even before developing diseases, and disease like cancer add an extra layer of mutations, while only a very small fraction of such variants have diagnostic or prognostic value. Moreover it is also challenging to predict how the patient will respond to a specific drug based on the patients omic profiling, since any drug introduces further perturbations in the biochemical model. MethodsIn this paper we propose the method Personalized-DrugRank for joint prediction of therapy response and time-to-response for cancer patients undergoing pharmacological therapy after surgery. The method is based on personalizing the DrugMerge methodology for drug repositioning in order to extract a few synthetic indices useful as input to ML prediction tools. In particular the proposed methodology is a novel and principled approach to merging independent patient-specific transcriptomic data with drug perturbation data from cell line assays. One of the key novel features of our approach over the state of the art is the joint prediction of the response of the patient to therapy along with an estimate of the time-to-response (i.e the prediction of the time needed for the therapy to succeed or fail). FindingsWe tested our methodology on data from the TCGA (The Cancer Genome Atlas) Program for three cancer types (Breast, Stomach and Colorectal cancer), 10 pharmacological regimens and 13 homogeneous cohorts. For the therapy response prediction task we developed models that attain an average AUC performance 0.749, average pvalue 0.030, average accuracy 0.809 with balanced Positive and Negative Predicting Values. For the time-to-event prediction task we developed regression models for the 13 homogeneous cohorts that attain an average (geometric) Concordance Index performance 0.782 (max 0.904, min 0.651) with average log likelihood pvalue 0.004, improving in nine cohorts over 13 upon models based only on clinical parameters having average Concordance Index 0.678 and average p-value 0.006. Interestingly, we attain statistical significant results even with quite small therapy-homogeneous cohorts (ranging from a minimum of 7 patients to a maximum of 32). ConclusionsThe ability of predicting with high accuracy the response of a cancer patient to a chosen pharmacological therapeutic regimen along with an estimate of the time-to-response helps adapting the clinical decision process to the specific patient profile, thus increasing the likelihood of providing correct and timely therapeutic decisions.
Jayme, A.; Heuveline, V.
Show abstract
Background and ObjectiveGlioblastoma outcome prediction remains difficult because clinically relevant signals are distributed across heterogeneous imaging and genomic modalities, cohorts are small, and conventional neural predictors do not quantify their own uncertainty. This study evaluates a hybrid neural-Bayesian belief network framework for uncertainty-aware multimodal glioblastoma prediction and examines how modality selection, model family, and structure-aware regularization affect predictive performance and confidence quality. MethodsThe framework was evaluated on the TCGA-GBM radiogenomic cohort using four input modalities (T1Gd, FLAIR, mRNA, and CNA), five model families, five structural-weight settings, and 15 view subsets. A secondary benchmark on the UCI Human Activity Recognition dataset was included to assess whether observed limitations were specific to the glioblastoma setting. ResultsCNA features consistently reduced performance in most multimodal settings, and selective fusion excluding CNA outperformed both the full four-view baseline and imaging-only alternatives. Model families showed clear differences in uncertainty behaviour: non-Bayesian families achieved the strongest predictive accuracy, whereas the Bayesian family achieved the lowest calibration error over a narrower confidence range. Bayesian belief network regularization produced consistent directional improvements without supporting reliable structure-discovery claims, as learned graph structures were not reproducible across folds. On the secondary bench-mark, the same framework achieved much higher predictive performance, indicating that the glioblastoma performance ceiling primarily reflects data limitations rather than an architectural constraint. ConclusionsIn small-sample radiogenomic prediction, modality choice is at least as important as model choice, and uncertainty quality differs substantially across uncertainty-aware model families. The proposed framework provides a practical basis for comparing accuracy, calibration, modality selection, and structure-aware regularization in multimodal biomedical prediction.
Dahlberg, A. C. H.; Tapiola, O.; Luisto, R.; Puranen, T.; Sanmark, E.; Vartiainen, V.
Show abstract
Background: Embedding models are an integral part of generative AI architectures, transforming text into embedding vectors that represent semantic content in numerical form. Despite their central role, their performance in clinical settings remains underexplored. We evaluate embedding models across two tasks: semantic difference detection in clinical texts, and data retrieval from patient records. Methods: Eight models were applied to synthetic discharge summaries in English, Finnish, and Swedish. Semantic sensitivity was assessed by introducing controlled perturbations (deletion, modification, and paraphrasing) at three levels of severity; cosine similarity, and L1 and Euclidean distances were computed between the vectors of the original and perturbed texts. Partial vectors were compared to explore dimensionality reduction. Two models with the biggest contrast in semantic difference detection were evaluated on retrieval of relevant information from real Finnish vascular surgery records. Results: Embedding vectors captured semantic differences in clinical text: content deletion and modification produced larger increases in vector distance than paraphrasing. On average, models detected the direction of semantic change correctly, but case-level performance varied considerably. Qwen3-Embedding-8B was the only model with zero directional errors, while multilingual-E5-large erred in 13.8% of cases. In data retrieval, Qwen3-Embedding-8B again outperformed multilingual-E5-large, though the margin was narrower: sufficiency scores were 3.25 vs. 3.17 out of 5 for the first query and 2.25 vs. 1.15 out of 5 for the second query. For some models, as few as 0.6-1.2% of dimensions sufficed to replicate full-vector accuracy; principal component analysis and coordinate-level analysis did not account for this finding. Conclusions: Our results show that the choice of embedding model is important: performance differences between models can be large enough to determine whether clinically relevant information reaches the end user, and model weaknesses can be both task-specific and context-dependent.
Pascual, N.; Fernandez-Pichel, M.; Losada, D. E.; Garcia-Orosa, B.; Gude, F.; Costa Lathan, C.; Sueiro Justel, J.; Gomez Fontenla, A.; Lastra Perez, M.; Alonso Garcia,, F.
Show abstract
Since the release of the first ChatGPT model in 2022, large language models (LLMs) have evolved significantly, and an increasing number of users now turn to these generative information systems for inquiries as sensitive and consequential as those related to health. The primary objective is to identify the main strengths and weaknesses of generative AI systems when responding to information needs as critical as those arising in the health domain. The study was structured using a question-answer format, in which each question corresponded to a user query and each answer represented the output generated by a model in response. The study employed a human evaluation framework involving two distinct panels of clinical experts from different specialties. The evaluation criteria encompassed three dimensions: adherence to medical consensus; presence or absence of inappropriate or incorrect information; and the potential to cause harm to users. GPT-4o mini, Llama 3, and MedLlama 3 were selected as three representative systems for the experiments. This study presents a detailed analysis of the performance of widely used contemporary large language models in addressing common health-related queries posed by online users. The results reinforce the potential of LLMs as tools for online health information seeking among non-expert users. However, the performance limitations identified underscore the need for further studies to monitor the future development of these models. Among them, performance issues have been identified in areas where users may be more vulnerable, leading to the retrieval of clinically incorrect information, particularly in matters relating to rare diseases. Furthermore, it has been noted that these models can become trapped in obsolete medical knowledge due to continuous scientific progress.
Salome, P.; Knoll, M.; Walz, D.; Cogno, N.; Dedeoglu, A. S.; Qi, A. L.; Isakoff, S. J.; Abdollahi, A.; Jimenez, R. B.; Bitterman, D. S.; Paganetti, H.; Chamseddine, I.
Show abstract
IntroductionManual data extraction from unstructured clinical notes is labor-intensive and impractical for large-scale clinical and research operations. Existing automated approaches typically require large language models, dedicated computational infrastructure, and/or task-specific fine-tuning that depends on curated data. The objective of this study is to enable accurate extraction with smaller locally deployed models using a disease-site specific pipeline and prompt configuration that are optimized and reusable. Materials/MethodsWe developed OncoRAG, a four-phase pipeline that (1) generates feature-specific search terms via ontology enrichment, (2) constructs a clinical knowledge graph from notes using biomedical named entity recognition, (3) retrieves relevant context using graph-diffusion reranking, and (4) extracts features via structured prompts. We ran OncoRAG using Microsoft Phi-3-medium-instruct (14B parameters), a mid-size language model deployed locally via Ollama. The pipeline was applied to three cohorts: triple-negative breast cancer (TNBC; npatients=104, nfeatures=42; primary development), recurrent high-grade glioma (RiCi; npatients=191, nfeatures=19; cross-lingual validation in German), and MIMIC-IV (npatients=100, nfeatures=10; external testing). Downstream task utility was assessed by comparing survival models for 3-year progression-free survival built from automatically extracted versus manually curated features. ResultsThe pipeline achieved mean F1 scores of 0.80 {+/-} 0.07 (TNBC; npatients=44, nfeatures=42), 0.79 {+/-} 0.12 (RiCi; npatients=61, nfeatures=19), and 0.84 {+/-} 0.06 (MIMIC-IV; npatients=100, nfeatures=10) on test sets under the automatic configuration. Compared to direct LLM prompting and naive RAG baselines, OncoRAG improved the mean F1-score by 0.19 to 0.22 and 0.17 to 0.19, respectively. Manual configuration refinement further improved the F1-score to 0.83 (TNBC) and 0.81 (RiCi), with no change in MIMIC-IV. Extraction time averaged 1.7-1.9 seconds per feature with the 14B model. Substituting a smaller 3.8B model reduced extraction time by 57%, with a decrease in F1-score (0.03-0.10). For TNBC, the extraction time was reduced from approximately two weeks of manual abstraction to under 2.5 hours. In an exploratory survival analysis, models using automatically extracted features showed a comparable C-index to those with manual curation (0.77 vs 0.76; 12 events). ConclusionsOncoRAG, deployed locally using a mid-size language model, achieved accurate feature extraction from multilingual oncology notes without fine-tuning. It was validated against manual extraction for both retrieval accuracy and survival model development. This locally deployable approach, which requires no external data sharing, addresses a critical bottleneck in scalable oncology research. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=89 SRC="FIGDIR/small/26347717v1_ufig1.gif" ALT="Figure 1000"> View larger version (23K): org.highwire.dtl.DTLVardef@178a4e8org.highwire.dtl.DTLVardef@1928b7corg.highwire.dtl.DTLVardef@38f36org.highwire.dtl.DTLVardef@1af4d51_HPS_FORMAT_FIGEXP M_FIG C_FIG
Rey-Blanes, A.; Veredas-Morente, J.; Vivas-Vargas, E.; Gil-Garcia, F.; Moreno-Barea, F. J.; Veredas, F. J.
Show abstract
Background and Objective: Access to real-world electronic health records (EHRs) remains limited by privacy, governance and annotation constraints, hindering the development of clinical natural language processing models. Realistic synthetic progress notes may provide EHR-like corpora that preserve clinically rigorous information on diagnoses, treatments, symptoms, imaging, laboratory findings and therapeutic trajectories without relying directly on sensitive patient records. This study evaluates whether large language models (LLMs) can generate realistic Spanish prostate cancer progress notes from published case reports, preserving clinical content, temporality and hospital-style conventions.
Liu, X.; Garg, M.; Jeon, E.; Jia, H.; Sauver, J. S.; Pagali, S. R.; Sohn, S.
Show abstract
Clinical narrative text contains crucial patient information, yet reliable extraction remains challenging due to linguistic variability, documentation habits, and differences across care settings. Large language models (LLMs) have shown strong accuracy on clinical information extraction (IE), but their reproducibility (stability under repeated runs) and robustness (stability under small, natural prompt variations) are less consistently quantified, despite being central to clinical deployment. In this study, we evaluate three open-weight LLMs representing distinct modeling choices: a dense general-purpose model (Llama 3.3), a mixture-of-experts (MoE) general-purpose model (Llama 4), and a domain-tuned medical model (MedGemma). We focus on binary clinical IE aligned with four mobility classes from the International Classification of Functioning, Disability and Health (ICF) framework. Using a controlled experimental design, we quantify (1) intra-prompt reproducibility across repeated sampling and (2) inter-prompt robustness across paraphrased prompts. We jointly report predictive performance (F1-score) and stability (Fleiss' Kappa [{kappa}]). And we test factor effects using three-way ANOVA with post-hoc comparisons. Results show that increasing temperature generally degrades agreement, but the magnitude depends on model and task; furthermore, prompt paraphrasing can substantially reduce stability, with particularly large drops for the MoE model. Finally, we evaluate a practical mitigation, self-consistency via majority voting, which improves {kappa} substantially and often improves or preserves F1-score, at the cost of additional inference. Together, these findings provide a reproducible framework and concrete recommendations for evaluating and improving LLM reliability in clinical IE.
Weyrich, J.; Dennstaedt, F.; Foerster, R.; Schroeder, C.; Aebersold, D. M.; Zwahlen, D. R.; Windisch, P.
Show abstract
PurposeLarge language models (LLMs) offer significant potential for automating the classification of clinical trials by eligibility criteria. However, a critical question remains regarding the optimal input data: while abstracts provide a condensed, high-density signal, full-text articles contain a much higher volume of information. It remains unclear whether the additional signal found in full texts improves classification performance or if the accompanying noise (in the form of thousands of words irrelevant to the question at hand in a complete manuscript) negatively affects the models reasoning capabilities. MethodsGPT-5 was applied to classify 200 randomized controlled oncology trials from high-impact medical journals, labeling them whether patients with localized and/or metastatic disease were eligible for inclusion. Each trial was classified twice - once using only the abstract and once using the full text - and GPT-5s outputs were compared with the ground-truth labels established by manual annotation. Performance was assessed by calculating and comparing accuracy, precision, recall, and F1 score, and the McNemar test was used to assess the statistical significance of the differences between the two input formats. ResultsFor identifying trials including patients with localized disease, GPT-5 achieved an accuracy of 86% (95% CI: 81% - 91%; F1 = 0.90) when using abstracts and 92% (95% CI: 88% - 95%; F1 = 0.92) when using full texts (p = 0.027). Performance for detecting trials, which include patients with metastatic disease, was comparably high, with accuracies of 99% (95% CI: 99% - 100%; F1 = 1.00) based on abstracts and 98% (95% CI: 97% - 100%; F1 = 0.99) based on full texts. Overall accuracy for assigning combined labels per trial increased from 86% (95% CI: 81% - 91%) using abstracts to 92% (95% CI: 88% - 95%) using full texts (p = 0.027). ConclusionProviding full-text articles to GPT-5 significantly improved the classification of trial eligibility criteria. These findings suggest that, for this task, the benefit of the additional signal contained within the full text outweighed the potential for performance degradation caused by increased noise. Utilizing full-text analysis appears particularly valuable for extracting specific eligibility criteria in oncology that are frequently omitted or not explicitly described within the abstract.
Cao, X.; Hou, J.; Wei, X.; Wang, Q.
Show abstract
We present a suite of foundational, outcome prediction models for critically ill patients, developed using readily available, routine blood tests and advanced machine learning techniques. The input data of the models includes complete blood counts (CBCs), metabolic panels, and additional biomarkers that assess liver and kidney function, coagulation status, and cardiac injury. The output yields the predicted outcome at a given future horizon. For diagnoses, the length of the future horizon is set to zero while it is set to a fixed time interval for prognoses. The training dataset in this study comprises clinical data from 332 ICU patients, augmented with 200 synthetic samples generated via a conditional diffusion model. Generative machine learning-based data imputation and augmentation approaches yielded modest gains in predictive accuracy. However, substantial performance improvements were achieved through additional methods, including dimensionality and order reduction, SHAP-based feature importance analysis, and a novel time-series-to-image encoding strategy that enables the use of image-based classifiers for temporal clinical data. Principal component analysis-based order reduction produced measurable gains in outcome prediction, while the time-series-to-image encoding proved particularly effective in mitigating small-data limitations common in clinical research. Across all evaluation metrics--accuracy, precision, recall, F1 score, and AUROC--the prognostic models achieved performance exceeding 85%, with some models attaining AUROC scores above 90%. We innovated a new model-ensemble approach to optimize the predictive outcome. This ensemble modeling approach improves the overal prediction, pushing all assessment metrics over 90%. This work establishes a robust and interpretable AI-enabled diagnostic and prognostic toolkit for outcome predictions in critically ill patients and demonstrates a scalable workflow for developing high-performing models from sparse healthcare datasets. The proposed framework is readily deployable in ICU environments with routine blood testing capabilities and serves as a foundation for future integration into digital twin systems for critical care.
Dai, H.-J.; Mir, T. H.; Fang, L.-C.; Chen, C.-T.; Feng, H.-H.; Lai, J.-R.; Hsu, H.-C.; Nandy, P.; Panchal, O.; Liao, W.-H.; Tien, Y.-Z.; Chen, P.-Z.; Lin, Y.-R.; Jonnagaddala, J.
Show abstract
Accurate recognition and deidentification of sensitive health information (SHI) in spoken dialogues requires multimodal algorithms that can understand medical language and contextual nuance. However, the recognition and deidentification risks expose sensitive health information (SHI). Additionally, the variability and complexity of medical terminology, along with the inherent biases in medical datasets, further complicate this task. This study introduces the SREDH/AI-Cup 2025 Medical Speech Sensitive Information Recognition Challenge, which focuses on two tasks: Task-1: Speech transcription systems must accurately transcribe speech into text; and Task-2: Medical speech de-identification to detect and appropriately classify mentions of SHI. The competition attracted 246 teams; top-performing systems achieved a mixed error rate (MER) of 0.1147 and a macro F1-score of 0.7103, with average MER and macro F1-score of 0.3539 and 0.2696, respectively. Results were presented at the IW-DMRN workshop in 2025. Notably, the results reveal that LLMs were prevalent across both tasks: 97.5% of teams adopted LLMs for Task 1 and 100% for Task 2. Highlighting their growing role in healthcare. Furthermore, we finetuned six models, demonstrating strong precision ([~]0.885-0.889) with slightly lower recall ([~]0.830-0.847), resulting in F1-scores of 0.857-0.867.
Weissenbacher, D.; Shabbir, M.; Campbell, I. M.; Berdahl, C. T.; Gonzalez-Hernandez, G.
Show abstract
Background: Large language models (LLMs) contain limited professional medical knowledge, as large-scale training on clinical text has not yet been possible due to restricted access. Objectives: To continue pre-training an open-access instruct LLM on de-identified medical notes and evaluate the resulting impact on real-world clinical decision-making tasks and standard benchmarks. Methods: Using 500K de-identified clinical notes from Cedars-Sinai Health System, we fine-tuned a Qwen3-4B Instruct model with supervised learning to generate medical decision-making (MDM) paragraphs from patient presentations, and evaluated it on assigned-diagnosis prediction, in-hospital cardiac-arrest mention detection, and a suite of general and biomedical benchmarks. Results: The fine-tuned model produced MDMs that closely resembled those written by physicians and outperformed the base-instruct model and larger clinically untrained models (Qwen3-32B and Llama-3.1-405B Instruct) on assigned-diagnosis prediction, the task most aligned with its training objective. On the task of detecting in-hospital cardiac arrest mentions, the model initially exhibited mild label collapse, but a brief task-specific fine-tuning stage resolved this issue and allowed it to surpass all competitors. The model also demonstrated global general knowledge retention on biomedical and general-domain evaluation benchmarks compared to the baseline. Conclusion: Supervised full fine-tuning on clinical notes allowed the model to incorporate medical knowledge without sacrificing general-domain abilities, and to transfer this knowledge to unseen biomedical tasks without wholesale loss of general-domain abilities, while revealing collapse-related failure modes that motivate more principled strategies for clinical specialization.
Ray, P.
Show abstract
Thyroid carcinoma is one of the most prevalent endocrine malignancies worldwide, and accurate preoperative differentiation between benign and malignant thyroid nodules remains clinically challenging. Diagnostic methods that medical practitioners use at present depend on their personal judgment to evaluate both imaging results and separate clinical tests, which creates inconsistency that leads to incorrect medical evaluations. The combination of radiological imaging with clinical information systems enables healthcare providers to enhance their capacity to make reliable predictions about patient outcomes while improving their decision-making abilities. The study introduces a deep learning framework that utilizes multiple data sources by combining magnetic resonance imaging (MRI) data with clinical text to predict thyroid cancer. The system uses a Vision Transformer (ViT) to obtain advanced MRI scan features, while a domain-adapted language model processes clinical documents that contain patient medical history and symptoms and laboratory results. The cross-modal attention system enables the system to merge imaging data with textual information from different sources, which helps to identify how the two types of data are interconnected. The system uses a classification layer to classify the fused features, which allows it to determine the probability of cancerous tumors. The experimental results show that the proposed multimodal system achieves better results than the unimodal base systems because it has higher accuracy, sensitivity, specificity, and AUC values, which help medical personnel to make better preoperative decisions.
Corga Da Silva, R.; Romano, M.; Mendes, T.; Isidoro, M.; Ravichandran, S.; Kumar, S.; van der Heijden, M.; Fail, O.; Gnanapragasam, V. E.
Show abstract
Background: Clinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. Objective: To evaluate clinician-reported time savings, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methods: In this prospective, single-arm pilot study, 29 clinicians across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations and a final Net Promoter Score. Non-parametric methods were used throughout. Results: Clinicians reported high perceived time saving (mean 4.27/5; 95% CI: 3.97-4.57) and decision support (4.16/5; 95% CI: 3.86-4.45), with ratings stable across all study days and no evidence of attrition bias. The NPS was 81.2, with no detractors. Conclusions: Clinicians across specialties and career stages reported sustained satisfaction with DR. INFO for both time efficiency and clinical decision support. Validation in larger, controlled studies with objective outcome measures is warranted. Keywords: Medical AI assistant, LLMs in healthcare, Agentic AI, Clinical decision support, Point of care AI
Balaji, S.; Campbell, K.; Chen, R.-Z.; Smith, D. G.; Reyna, M. A.; Sarker, A.; Wallach, J. D.; Parikh, R. B.; Bozkurt, S.
Show abstract
BackgroundIdentification of metastasis status in non-small cell lung cancer (NSCLC) is a critical part of understanding disease prognosis, treatment courses, trial eligibility, and population-level cancer surveillance. However, metastasis record are inconsistently recorded in structured cancer registry fields, since manual abstraction of clinical notes is often a resource intensive and error-prone process. This challenge highlights an opportunity for leveraging large language models (LLMs) to conduct high-scale metastasis extraction from real-world clinical documentation. ObjectiveWe conducted a retrospective, multi-cohort comparative evaluation of three distinct LLMs for two independent classification tasks: overall metastasis presence at any site and brain/CNS metastasis presence. We evaluated model performance on two independent NSCLC cohorts: (1) a registry-linked cohort used for model development and validation and (2) an independent cohort with manual note-level annotations for additional validation. We further explored whether our methods could analyze clinical documentation and recover missing or outdated metastasis information in structured registry labels. MethodsPatient cohorts were derived from the Winship Cancer Institute. Cohort 1 (n=579 patients; 24,887 notes across 69 note types; 2023-2025) used registry-linked metastasis fields as the reference standard. Cohort 2 (n=22 patients; 644 radiology notes; 2010-2021) was drawn from two completed randomized trials and used dual-annotator manual labels (Cohens &[kappa]: 0.93 overall metastasis, 0.88 CNS metastasis) as the reference standard. We fine-tuned the GatorTron-base encoder model for each independent binary classification task, respectively. We evaluated MedGemma-27B-text and Llama 3.1-70B using zero-shot prompting. A separate cohort of 675 patients with missing or unknown registry labels was used for an exploratory missingness-recovery analysis, validated against manual annotations of a random subsample. ResultsMore than half (54%) of initially identified Cohort 1 patients had missing or unknown registry metastasis labels. For overall metastasis, fine-tuned MedGemma demonstrated the best performance in overall metastasis classification (Cohort 1: F1=0.80, Cohort 2 patient level: F1=1.0, Cohort 2 note level: F1=0.93). For brain/CNS metastasis, Llama3 performed best in both cohorts (Cohort 1: F1=0.79, Cohort 2 patient-level: F1=0.93, Cohort 2 note-level: F1=0.86). The fine-tuned GatorTron model showed strong performance for classification of overall metastasis in Cohort 1 (F1=0.72). Error analysis indicated that most model errors reflected incomplete registry labels, ambiguous clinical language, or missing documentation rather than true model errors. In the exploratory recovery analysis, model predictions agreed with manual annotations at accuracy=0.90 and F1=0.89. ConclusionsAll models demonstrated relatively high performance. The zero-shot generative models were more robust to nuanced documentation and context-dependent brain/CNS metastasis extraction. The fine-tuned encoder model demonstrated strong classification performance but may have been limited by potential inaccuracies in the registry reference standards during model training. This study further demonstrated the potential of LLMs in recovering clinically plausible structured labels from narrative text, complementing cancer registries for metastasis ascertainment.
Reinosa, R.
Show abstract
IntroductionThe translation of biomarkers into binary clinical decisions requires the determination of precise cut-off points. This study validates the TholdStormDX v0.0.1 tool, a mathematical engine that employs Dual Annealing, 2- and 4-parameter logistic fitting, and vectorized Monte Carlo simulations for panel optimization under Boolean OR logic. MethodsThe tool was evaluated using datasets from four diagnostic domains (Pulmonary Nodules, Hepatocellular Carcinoma [HCC], Cervical Cancer, and Breast Cancer), along with a prognosis-oriented analytical context (Breast Cancer). Validation followed a strict workflow: characterization and selection of the best individual and combined thresholds in the Training (Train) and Validation (Val) sets, using the Test set in a completely independent manner, solely to assess the models performance and generalizability. ResultsThe tool enabled precise derivation of cut-off points for both individual biomarkers and multivariable combinations. Evaluation on the Test set objectively demonstrated in which scenarios a single biomarker outperforms a complex panel, promoting clinical parsimony. For example, in Breast Cancer diagnosis, an individual predictor outperformed the optimized panel (Sensitivity: 0.953 / Specificity: 0.952 in Test); conversely, in Hepatocellular Carcinoma, the multivariable combination showed superior performance compared to the single marker (Sens: 0.707 / Spe: 0.718 in Test). Additionally, the self-auditing system effectively flagged metric degradation when noisy variables were included, preventing potential issues. ConclusionTholdStormDX v0.0.1 proves to be a robust and transparent bioinformatics platform for deriving clinical thresholds. Its main contribution lies in mitigating local minima and promoting clinical parsimony, enabling researchers to objectively identify when a single biomarker is sufficient and when a panel provides real added value. Furthermore, it transforms the problem of biological noise into a safety feature: by systematically warning about algorithmic instability, it prevents overfitting and ensures the clinical viability of medical decisions. AvailabilityThe software is free and distributed under the GNU GPLv3 license. TholdStormDX v0.0.1 is written in Python, and its source code is available at the following GitHub address: https://github.com/roberto117343/TholdStormDX. Contactroberto117343@gmail.com
Bian, R.; Cheng, W.
Show abstract
The rapid development of large language models (LLMs) has stimulated growing interest in their use for medical question answering and clinical decision support. However, compared with frontier proprietary systems, the empirical understanding of lightweight open-source LLMs in medical settings remains limited, particularly under resource-constrained experimental conditions. To address this gap, we introduce MedScope, a lightweight benchmarking framework for systematically evaluating open-source LLMs on medical multiple-choice question answering. Using 1,000 sampled questions from MedMCQA, we benchmark six lightweight open-source models spanning three representative model families: LLaMA, Qwen, and Gemma. Beyond standard predictive metrics such as accuracy and macro-F1, our framework additionally considers inference time, prediction consistency, subject-wise variability, and model-specific error patterns. We further develop a set of multi-perspective visual analyses, including clustered heatmaps, agreement matrices, Pareto-style trade-off plots, radar charts, and multi-panel summary figures, in order to characterize model behavior in a more interpretable and comprehensive manner. Our results reveal substantial heterogeneity across models in predictive performance, efficiency, and subject-level robustness. While larger lightweight models generally achieve better overall results, the gain is neither uniform across subject categories nor always aligned with efficiency. These findings suggest that lightweight open-source LLMs remain valuable as transparent and reproducible medical AI baselines, but their current capabilities are still insufficient for unsupervised deployment in high-risk healthcare scenarios. MedScope provides an accessible benchmark for evaluating lightweight medical LLMs and emphasizes the need for multi-dimensional assessment beyond accuracy alone.The relevant code is now open-sourced at: https://github.com/VhoCheng/MedScope.
Adegbosin, O. T.; Patel, H.
Show abstract
BackgroundMicrosatellite stability status determination is important for prognostication and therapeutic decision making in colorectal cancer management, but the conventional methods for this assessment are not readily available, especially in low- and middle-income countries. Deep learning (DL) models have been proposed for addressing this problem; however, potential computational cost due to model complexity and inadequate explainability may limit their adoption in low-resource settings. This study explored the potential of explainable lightweight models for detection of microsatellite instability in colorectal cancer. MethodsDL models were trained using a public dataset of colorectal cancer histology images and then used to classify a set of test images into one of two classes: microsatellite instability or microsatellite stability. The models were compared for efficiency. Gradient-weighted class activation mapping (Grad-CAM) was used to interpret the models decision making. ResultsThe simpler convolutional neural network (CNN) trained from scratch had modest performance (accuracy=0.757, area under receiver-operating characteristic curve [AUROC]=0.840). With an attention mechanism added, these values increased, but specificity and sensitivity reduced. Pretrained models performed better than the ones trained from scratch, and EfficientNet_B0 had the best balance of high performance and low computational requirements (accuracy=0.936, AUROC=0.990, negative predictive value=0.923, specificity=0.953, 4,010,000 trainable parameters, 0.38 gigaFLOPs). However, a simple CNN model with attention mechanism had the best interpretability based on Grad-CAM. ConclusionThis study demonstrated that DL models that are lightweight when compared to previously proposed ones can be useful for colorectal cancer microsatellite instability screening in resource-limited settings while balancing performance and computational efficiency.
Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.
Show abstract
ObjectivesLarge language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & MethodsWe set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). ResultsWe saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. DiscussionBesides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. ConclusionsOur findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.
Hakata, Y.; Oikawa, M.; Fujisawa, S.
Show abstract
BackgroundAdult diffuse glioma is a representative class of primary brain tumors for which accurate MRI-based tumor segmentation is indispensable for treatment planning. Conventional automated segmentation methods have relied primarily on image information and spatial prompts, and auxiliary clinical information that is routinely acquired in clinical practice has not been sufficiently exploited as an input. ObjectiveBuilding on a dual-prompt-driven Segment Anything Model (SAM) extension framework [20] that fuses visual and language reference prompts, we propose a method that integrates patient demographics, unsupervised molecular cluster variables derived from TCGA high-throughput profiling, and histopathological parameters as learnable prompt embeddings, and we evaluate its effect on the accuracy of lower-grade glioma (LGG) MRI segmentation. MethodsAn auxiliary prompt encoder converts clinical metadata into high-dimensional embeddings that are fused with the prompt representations of Segment Anything Model (SAM) ViT-B through a cross-attention fusion mechanism. The TCGA-LGG MRI Segmentation dataset (Kaggle release by Buda et al. [24]; n = 110 patients; WHO grade II-III) was split at the patient level (train/val/test = 71/17/22) using three different random seeds, and the three slices with the largest tumor area were extracted from each patient. To avoid pseudo-replication arising from multiple slices per patient and repeated measurements across seeds, our primary analysis aggregated Dice and 95th-percentile Hausdorff distance (HD95) to the patient x seed unit (n = 66); secondary analyses at the unique-patient level (n = 22) and at the per-slice level (n = 198) are also reported. Pairwise comparisons used paired t-tests with Bonferroni correction (k = 3) and Wilcoxon signed-rank tests, and a permutation test (K = 30) served as an auxiliary check of effective use of the auxiliary information. ResultsAt the patient x seed level (n = 66), Proposed (full clinical) achieved a Dice gain of {Delta} = +0.287 over the zero-shot SAM ViT-B baseline (paired-t p = 4.2 x 10-{superscript 1}, Cohens d_z = +1.25, Bonferroni-corrected p << 0.001; Wilcoxon p = 2.0 x 10-{superscript 1}), and HD95 improved from 218.2 to 64.6. Because zero-shot SAM is not designed for domain-specific medical segmentation, the large absolute HD95 gap largely reflects the expected domain gap rather than a competitive baseline. The additional contribution of the full clinical configuration over the demographics-only configuration was {Delta} Dice = +0.023 (paired-t p = 0.057, Bonferroni-corrected p = 0.172), which did not reach statistical significance at the patient level and is reported as a directional trend. The permutation test (K = 30, seed 2025) yielded real-metadata Dice = 0.819 versus a shuffled-metadata mean of 0.773, giving an empirical p = 0.032 = 1/(K + 1), which is at the resolution limit of this test and should therefore be interpreted as preliminary evidence. ConclusionsIntegrating auxiliary clinical information as multimodal prompts produced a large improvement over the zero-shot SAM baseline on this LGG cohort. More importantly, a robustness analysis showed that Proposed (full clinical) outperformed the trained Base (no auxiliary information) under all tested spatial-prompt conditions, including perfect centroid ({Delta} = +0.014), and that the advantage was most pronounced in the prompt-free regime ({Delta} = +0.231, p = 0.039), where the base model collapsed but the proposed model maintained meaningful segmentation by leveraging clinical metadata alone. The additional contribution of molecular and histopathological information beyond demographics was not statistically resolved at the patient level ({Delta} = +0.023, n.s.). Establishing clinical utility will require external validation on larger multi-center cohorts and direct comparisons with established segmentation methods.
Hakata, Y.; Oikawa, M.; Fujisawa, S.
Show abstract
BackgroundFederated learning (FL) enables collaborative model training across institutions without sharing patient-level data. However, standard FL algorithms such as FedAvg degrade under non-independently and non-identically distributed (non-IID) data, a prevalent condition when patient demographics, scanner hardware, and disease prevalence differ across hospital sites. ObjectiveWe propose iPS-MFFL (Individualized Per-Site Meta-Federated Feature Learning), a federated framework with a hierarchical local-model architecture that addresses non-IID heterogeneity through (1) a shared feature extractor, (2) multiple weak-learner classification heads that can be trained with heterogeneous training objectives to promote complementary decision boundaries, (3) independent per-learner server aggregation so that each weak learners parameters are averaged only with its counterparts at other clients, and (4) a lightweight meta-model -- itself federated -- that adaptively stacks the weak-learner outputs. The specific choices of backbone, weak-learner training objectives, and meta-model are implementation details; in this work we use an ImageNet-pretrained ResNet18 and three heterogeneous losses as a concrete instantiation. MethodsWe evaluate on the Brain Tumor MRI Classification dataset (7,200 images; 4 classes: glioma, meningioma, pituitary tumor, no tumor) partitioned across K = 5 simulated hospital sites using Dirichlet non-IID sampling ( = 0.3). Four baselines are compared: Local-only training, FedAvg, FedProx, and Freeze-FT. All experiments are repeated over three random seeds (13, 42, 2025) and evaluated using paired t-tests, Cohens d effect sizes, and post-hoc power analysis. ResultsiPS-MFFL achieved the highest mean final-round test accuracy point estimate of 85.42 {+/-} 8.70% (mean {+/-} SD across three seeds), compared to FedAvg (78.48 {+/-} 12.66%), FedProx (78.33 {+/-} 14.64%), Freeze-FT (73.98 {+/-} 21.09%), and Local (58.10 {+/-} 11.77%). iPS-MFFL showed the smallest cross-seed SD, suggesting greater robustness to partition heterogeneity. However, one-way ANOVA did not reach statistical significance (F = 1.52, p = 0.270), reflecting the limited number of experimental seeds. Cohens d effect sizes relative to iPS-MFFL ranged from 0.59 (vs. FedProx) to 2.64 (vs. Local); post-hoc pairwise comparisons are reported as exploratory given the non-significant omnibus test. Post-hoc power analysis indicated that statistical power for FL baseline comparisons was only 0.10-0.12 for the observed effect sizes (d {approx} 0.6) at n = 3 seeds. ConclusionsiPS-MFFL provides a practical approach to heterogeneous federated brain tumor classification by combining transfer learning, contrastive weak-learner diversity, and meta-learning. The framework demonstrated the highest mean accuracy and lowest variance across diverse data partitions. Validation with larger seed pools ([≥] 10 seeds for 80% power), ablation studies, and external multi-center cohorts is needed to establish generality.